13,131 research outputs found

    A framework for research into continental ancestry groups of the UK Biobank

    Get PDF
    BACKGROUND: The UK Biobank is a large prospective cohort, based in the UK, that has deep phenotypic and genomic data on roughly a half a million individuals. Included in this resource are data on approximately 78,000 individuals with “non-white British ancestry.” While most epidemiology studies have focused predominantly on populations of European ancestry, there is an opportunity to contribute to the study of health and disease for a broader segment of the population by making use of the UK Biobank’s “non-white British ancestry” samples. Here, we present an empirical description of the continental ancestry and population structure among the individuals in this UK Biobank subset. RESULTS: Reference populations from the 1000 Genomes Project for Africa, Europe, East Asia, and South Asia were used to estimate ancestry for each individual. Those with at least 80% ancestry in one of these four continental ancestry groups were taken forward (N = 62,484). Principal component and K-means clustering analyses were used to identify and characterize population structure within each ancestry group. Of the approximately 78,000 individuals in the UK Biobank that are of “non-white British” ancestry, 50,685, 6653, 2782, and 2364 individuals were associated to the European, African, South Asian, and East Asian continental ancestry groups, respectively. Each continental ancestry group exhibits prominent population structure that is consistent with self-reported country of birth data and geography. CONCLUSIONS: Methods outlined here provide an avenue to leverage UK Biobank’s deeply phenotyped data allowing researchers to maximize its potential in the study of health and disease in individuals of non-white British ancestry. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s40246-022-00380-5

    Enhanced methods for local ancestry assignment in sequenced admixed individuals.

    Get PDF
    Inferring the ancestry at each locus in the genome of recently admixed individuals (e.g., Latino Americans) plays a major role in medical and population genetic inferences, ranging from finding disease-risk loci, to inferring recombination rates, to mapping missing contigs in the human genome. Although many methods for local ancestry inference have been proposed, most are designed for use with genotyping arrays and fail to make use of the full spectrum of data available from sequencing. In addition, current haplotype-based approaches are very computationally demanding, requiring large computational time for moderately large sample sizes. Here we present new methods for local ancestry inference that leverage continent-specific variants (CSVs) to attain increased performance over existing approaches in sequenced admixed genomes. A key feature of our approach is that it incorporates the admixed genomes themselves jointly with public datasets, such as 1000 Genomes, to improve the accuracy of CSV calling. We use simulations to show that our approach attains accuracy similar to widely used computationally intensive haplotype-based approaches with large decreases in runtime. Most importantly, we show that our method recovers comparable local ancestries, as the 1000 Genomes consensus local ancestry calls in the real admixed individuals from the 1000 Genomes Project. We extend our approach to account for low-coverage sequencing and show that accurate local ancestry inference can be attained at low sequencing coverage. Finally, we generalize CSVs to sub-continental population-specific variants (sCSVs) and show that in some cases it is possible to determine the sub-continental ancestry for short chromosomal segments on the basis of sCSVs

    Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations.

    Get PDF
    BackgroundEstimation of individual ancestry from genetic data is useful for the analysis of disease association studies, understanding human population history and interpreting personal genomic variation. New, computationally efficient methods are needed for ancestry inference that can effectively utilize existing information about allele frequencies associated with different human populations and can work directly with DNA sequence reads.ResultsWe describe a fast method for estimating the relative contribution of known reference populations to an individual's genetic ancestry. Our method utilizes allele frequencies from the reference populations and individual genotype or sequence data to obtain a maximum likelihood estimate of the global admixture proportions using the BFGS optimization algorithm. It accounts for the uncertainty in genotypes present in sequence data by using genotype likelihoods and does not require individual genotype data from external reference panels. Simulation studies and application of the method to real datasets demonstrate that our method is significantly times faster than previous methods and has comparable accuracy. Using data from the 1000 Genomes project, we show that estimates of the genome-wide average ancestry for admixed individuals are consistent between exome sequence data and whole-genome low-coverage sequence data. Finally, we demonstrate that our method can be used to estimate admixture proportions using pooled sequence data making it a valuable tool for controlling for population stratification in sequencing based association studies that utilize DNA pooling.ConclusionsOur method is an efficient and versatile tool for estimating ancestry from DNA sequence data and is available from https://sites.google.com/site/vibansal/software/iAdmix

    Quantifying the legacy of the Chinese Neolithic on the maternal genetic heritage of Taiwan and Island Southeast Asia

    Get PDF
    There has been a long-standing debate concerning the extent to which the spread of Neolithic ceramics and Malay-Polynesian languages in Island Southeast Asia (ISEA) were coupled to an agriculturally driven demic dispersal out of Taiwan 4000 years ago (4 ka). We previously addressed this question using founder analysis of mitochondrial DNA (mtDNA) control-region sequences to identify major lineage clusters most likely to have dispersed from Taiwan into ISEA, proposing that the dispersal had a relatively minor impact on the extant genetic structure of ISEA, and that the role of agriculture in the expansion of the Austronesian languages was therefore likely to have been correspondingly minor. Here we test these conclusions by sequencing whole mtDNAs from across Taiwan and ISEA, using their higher chronological precision to resolve the overall proportion that participated in the “out-of-Taiwan” mid-Holocene dispersal as opposed to earlier, postglacial expansions in the Early Holocene. We show that, in total, about 20 % of mtDNA lineages in the modern ISEA pool result from the “out-of-Taiwan” dispersal, with most of the remainder signifying earlier processes, mainly due to sea-level rises after the Last Glacial Maximum. Notably, we show that every one of these founder clusters previously entered Taiwan from China, 6–7 ka, where rice-farming originated, and remained distinct from the indigenous Taiwanese population until after the subsequent dispersal into ISEA

    Assortative human pair-bonding for partner ancestry and allelic variation of the dopamine receptor D4 (DRD4) gene

    Get PDF
    The 7R allele of the dopamine receptor D4 gene has been associated with attention-deficit hyperactivity disorder and risk taking. On the cross-population scale, 7R allele frequencies have been shown to be higher in populations with more of a history of long-term migrations. It has also been shown that the 7R allele is associated with individuals having multiple-ancestries. Here we conduct a replication of this latter finding with two independent samples. Measures of subjects’ ancestry are used to examine past reproductive bonds. The individuals’ history of inter-racial/ancestral dating and their feelings about this are also assessed. Tentative support for an association between multiple ancestries and the 7R allele were found. These results are dependent upon the method of questioning subjects about their ancestries. Inter-racial dating and feelings about inter-racial pairing were not related to the presence of the 7R allele. This might be accounted for by secular trends that might have substantively altered the decision-making process employed when considering relationships with individuals from different groups. This study provides continued support for the 7R allele playing a role in migration and/or mate choice patterns. However, replications and extensions of this study are needed and must carefully consider how ancestry/race is assessed

    Ancient pigs reveal a near-complete genomic turnover following their introduction to Europe

    Get PDF
    Archaeological evidence indicates that pig domestication had begun by ∼10,500 y before the present (BP) in the Near East, and mitochondrial DNA (mtDNA) suggests that pigs arrived in Europe alongside farmers ∼8,500 y BP. A few thousand years after the introduction of Near Eastern pigs into Europe, however, their characteristic mtDNA signature disappeared and was replaced by haplotypes associated with European wild boars. This turnover could be accounted for by substantial gene flow from local European wild boars, although it is also possible that European wild boars were domesticated independently without any genetic contribution from the Near East. To test these hypotheses, we obtained mtDNA sequences from 2,099 modern and ancient pig samples and 63 nuclear ancient genomes from Near Eastern and European pigs. Our analyses revealed that European domestic pigs dating from 7,100 to 6,000 y BP possessed both Near Eastern and European nuclear ancestry, while later pigs possessed no more than 4% Near Eastern ancestry, indicating that gene flow from European wild boars resulted in a near-complete disappearance of Near East ancestry. In addition, we demonstrate that a variant at a locus encoding black coat color likely originated in the Near East and persisted in European pigs. Altogether, our results indicate that while pigs were not independently domesticated in Europe, the vast majority of human-mediated selection over the past 5,000 y focused on the genomic fraction derived from the European wild boars, and not on the fraction that was selected by early Neolithic farmers over the first 2,500 y of the domestication process

    Improved Imputation of Common and Uncommon Single Nucleotide Polymorphisms (SNPs) with a New Reference Set

    Get PDF
    Statistical imputation of genotype data is an important technique for analysis of genome-wide association studies (GWAS). We have built a reference dataset to improve imputation accuracy for studies of individuals of primarily European descent using genotype data from the Hap1, Omni1, and Omni2.5 human SNP arrays (Illumina). Our dataset contains 2.5-3.1 million variants for 930 European, 157 Asian, and 162 African/African-American individuals. Imputation accuracy of European data from Hap660 or OmniExpress array content, measured by the proportion of variants imputed with R^2^>0.8, improved by 34%, 23% and 12% for variants with MAF of 3%, 5% and 10%, respectively, compared to imputation using publicly available data from 1,000 Genomes and International HapMap projects. The improved accuracy with the use of the new dataset could increase the power for GWAS by as much as 8% relative to genotyping all variants. This reference dataset is available to the scientific community through the NCBI dbGaP portal. Future versions will include additional genotype data as well as non-European populations
    corecore